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(54) Network management system with improved node discovery and monitoring 



(57) A metliod and system for monitoring nodes in 
a network having at least one network management sta- 
tion and a plurality of nodes. A queue (10) stores polling 
messages for transmission to the nodes where each 
node is indexed by its network address. The network 
management station sends polling messages to the 
nodes in sequence at a predetermined rate controlled 
by a rate control mechanism (1 2). Polling messages are 
sent up to four times to a particular node. The transmis- 
sion of these messages are recorded in a table which is 
indexed by the network address of each node, and by 
the time of the next scheduled timeout (the time period 
between successive polling messages) associated with 
each node. The network management station deter- 
mines if another polling message should be sent to each 
of the nodes. If the fourth polling message has been sent 
to a particular node, it has been unacknowledged by that 
node and the timeout has expired, then the node is de- 
termined to have failed. 
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Description 

The present Invention relates to network manage- 
ment station and nnore particularly to a network man- 
agement station which reduces the elapsed time in 
which a network's topology is discovered and updated. 

Large communication infrastructures, known as in- 
ternets, are composed of wide and local area networks 
and consist of end-systems, intermediate systems and 
media devices. Communication between nodes on the 
networks is governed by communication protocols, such 
as the TCP/IP protocol. The end-systems include main- 
frames, workstations, printers and terminal servers. 
Intermediate systems typically include routers used to 
connect the networks together. The media devices, 
such as bridges, hubs and multiplexors, provide com- 
munication links between different end-systems in the 
network. In each network of an internet, the various end 
systems, Intermediate systems and media devices are 
typically manufactured by many different vendors and 
to manage these multi-vendor networks requires stand- 
ardised network management protocols. 

Generally, to support the communication network, 
network management personnel want to know what 
nodes are connected to the network, what each node is 
(e.g. a computer, router or printer), the status of each 
node, potential problems with the network, and if possi- 
ble any corrective measures that can be taken when ab- 
normal status, malfunction or other notifiable events are 
detected. 

To assist network management personnel in main- 
taining the operation of the internet, a network manage- 
ment framework was developed to define rules describ- 
ing management information, a set of managed objects 
and a management protocol. One such protocol is the 
simple network management protocol (SNMP). 

Network management systems need to interact with 
existing hardware while minimising the host processor 
time needed to perform network management tasks. In 
network management, the host processor or network 
management station is known as the network manager. 
A network manager is typically an end-system, such as 
a mainframe or workstation, assigned to perform the 
network managing tasks. More than one end-system 
may be used as a network manager. The network man- 
ager is responsible for monitoring the operation of a 
number of end-systems, intermediate systems and me- 
dia devices, which are known as managed nodes. The 
network manager, the corresponding managed nodes 
and the data links there between are known as a subnet. 
Many different tasks are performed by the network man- 
ager. One such task is to initially discover the different 
nodes (e.g. end-systems, routers and media devices) 
connected to the network. After discovery, the network 
manager continuously determines how the network or- 
ganisation has changed. For example, the network 
manager determines what new nodes are connected to 
the network. Another task performed after discovery. Is 



to determine which nodes on the network are operation- 
al. In other words, the network manager determines 
which nodes have failed. 

Once the nodes on the network are discovered and 
s their status ascertained, the information is stored in a 
database and network topology maps of the networks 
and/or subnets can be generated and displayed along 
with the status of the different nodes along the network 
to the network management personnel. Topology maps 

10 assist the personnel in the trouble shooting of network 
problems and with the routing of communicatbns along 
the networks, especially if nodes have failed. 

Through the discovery process, the network man- 
ager ascertains its internet protocol (IP) address, the 

IS range of IP addresses for the subnet components (i.e. 
the subnet mask), a routing table for a default router and 
address resolution protocol (ARP) cache tables from 
known and previously unknown nodes with SNMP 
agents. To ascertain the existence of network nodes, the 

20 discovery process performs configuration polls of 
known nodes and retrieves the ARP cache tables from 
the known nodes, and the routing tables. The network 
manager then verifies the existence of those nodes list- 
ed in these tables that it has not previously recorded in 

2S its database. 

Examples of network manager systems are the 
Onevision™ network management station produced by 
AT&T and the Openview"'"'^ network manager produced 
by Hewlett Packard, Currently these systems discover 

30 nodes and verify the existence and status of nodes by 
sending to each node an internet control message pro- 
tocol (ICMP) poll and waiting for a response. The ICMP 
poll is also known as a ping. If no response is received 
after a specific period of time, the node is determined to 

35 be non-operational or to have failed. The change in sta- 
tus of the node is then reflected by the network manage- 
ment station, for example, updating the topology map. 
Instances may occur when the ping is not received by 
the node, or the node is busy performing another task 

<o when the ping is set. Thus, to verify that a node has ac- 
tually failed, the network manager sends out a sequence 
of M pings, where M is an arbitrary but preferably a fixed 
number, such as four. Each successive ping is transmit- 
ted if a corresponding acknowledgement is not received 
during an associated scheduled timeout inten^al. Pref- 
erably, the timeout interval Is increased for each succes- 
sive ping. The sequence of pings terminates either if one 
of the pings is acknowledged, or if no acknowledgement 
has been received after the timeout interval associated 

50 with the Mth ping has expired. If no response is received 
to the Mth ping, the node is declared to be nonopera- 
tional ("down"). 

To illustrate, the network management station in the 
OpenView system sends an ICMP poll (ping) to a node 

55 and waits for a response. If no response is received from 
the first ping within ten seconds a second ping is sent 
out. If no response is received from the second ping 
within twenty seconds a third ping Is sent out. If no re- 
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sponse is received from the third ping within forty sec- 
onds a fourth ping is sent out. If no response is received 
from the fourth ping within eighty seconds the node Is 
declared down. The total time from when the first ping 
is sent to the detenmination that the node is down can s 
take about 2.5 minutes. 

To prevent an overflow of pings from occurring dur- 
ing, for example, initial discovery, these current systems 
limit the number of unacknowledged ICMP polls to three 
nodes or less. To limit the number of unacknowledged 
polls, the ICMP polls for each managed node are stored 
in memory (a pending polling queue) of the network 
management station and subsequently transferred to an 
active polling queue capable of queuing only three 
nodes. Thus, in the example of Fig. 1 , the queue for 
node A is in queue 1. the queue for node B is in queue 
2. and the queue for node C is in queue 3. The three 
nodes in the active polling queue are then each polled 
with an ICMP poll. As a poll is acknowledged or in the 
event a node is declared down, the queue is cleared and 
the next in line node is placed in the active polling queue. 
A ping is then sent to the next in line node. 

Using the above queueing configuration, if for ex- 
ample three failed nodes are polled in rapid succession, 
the status of other nodes cannot be ascertained for at 
least the next 2.5 minutes, since no more than three 
nodes nrtay have unacknowledged polls concurrently 
Similarly, it may take 5 minutes to diagnose the failure 
of six nodes in succession. It may take 7.5 minutes to 
diagnose the failure of nine nodes. As a result, the dis- 
covery and/or status polling process performed by the 
network management station could be substantially de- 
layed, thus increasing the elapsed time used by the net- 
work management station to perfomn network manage- 
ment tasks. Further, the topology map may be delayed 
in being updated, thus increasing the time to diagnose 
the problem with the network. 

With the increase in size and use of internets, the 
management of such networks has become increasing- 
ly difficult. The resulting increase in the number of nodes 
increases the possibility of polling several failed nodes 
in sequence. Currently, a failure of multiple nodes would 
cause the discovery procedure to be effectively frozen 
as described above. 

It Is an object of the present invention to provide an 
alternative technque for verifying the operational status 
of network nodes in order to reduce the elapsed time of 
network discovery and the elapsed time of status polling 
and to rapidly provide network configuration updates 
which may be displayed on the topology map and assist 
network management personnel in troubleshooting fail- 
ures more rapidly 

According to one aspect of the present invention 
there is provided a method for monitoring nodes in a net- 
work having at least one network management station 
and a plurality of nodes, characterised by the steps of: 
providing a queue of polling messages for transmission 
to the nodes, each polling message being indexed by 
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the network address of one of the nodes; sending said 
queued polling messages from the network manage- 
ment station to the plurality of nodes at a predetermined 
rate; recording transmission of the polling messages In 
a table having a first portion indexed by the network ad- 
dress of each node and a second portion indexed by a 
timeout associated with a polling message count for 
each node having an outstanding status poll; and deter- 
mining for each node if that node has failed after a pre- 
determined number of polling messages have been sent 
to that node. 

According to another aspect of the present inven- 
tion there is provided a system for managing a network 
comprising at least one network management station, 
and a plurality of nodes connected to the network man- 
agement station for data communications there be- 
tween, characterized in that each said network manage- 
ment station includes a queue of polling messages for 
transmission to the nodes, and a poll table having a first 
portion Indexed by a network address of each node and 
a second portion indexed by a timeout associated with 
a polling message count; and wherein said network 
management station determines for each node if that 
node has failed after a predetermined number of polling 
messages have been sent to that node within an 
elapsed timeout period. 

One embodiment of the present invention will now 
be described by way of example with reference to the 
accompanying drawings in which:- 

Fig. 1 is a block diagram of a known polling queue 

for determining the status of nodes; 

Fig. 2 is a block diagram of an exemplary network 

topology; 

Fig. 3 is a block diagram of a status poll transmis- 
sion mechanism according to the present invention; 
Fig. 4 is a block diagram of a status poll transmis- 
sion queue and an unacknowledged poll table ac- 
cording to the present invention; 
Fig. 5 is a block diagram of an unacknowledged poll 
table according to the present invention; 
Fig. 6 is a block diagram of the exemplary network 
topology of Fig. 2. illustrating a failed managed 
node and other nodes and links affected by the 
failed node; and 

Fig. 7 is a flow diagram for the operation of the net- 
work management station during discovery and sta- 
tus verification. 

The present invention provides a network manage- 
ment method and system which improves the discovery 
process and the status monitoring process of current 
network management systems. It should be noted that 
the following description is based on a communication 
network using the TCP/IP communication protocol and 
a networic managing framework using the SNMP proto- 
col. However, the invention is applicable to network 
management environments based on other network 
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configurations using other types of communication and 
management protocols as well. 

As noted above, during node discovery or during 
node status verification, the network manager sends IC- 
MP polls (pings) to each node identified in. for example, 
the ARP cache and any known router tables. The node 
manager then waits for a response from the target node. 
The response may include information, such as the 
node IP address, status information regarding the node, 
the type of node (e.g. computer, router or hub), and the 
number of interfaces In the node. When a response is 
received, the node Infomnation is stored in an IP topol- 
ogy database. Typically, the decision to manage a node 
is made by network management personnel or the net- 
work management station. 

Preferably, the IP topology database consists of a 
series of tables, each containing specific information re- 
garding the network. For example, a network table con- 
tains information associated with each network in the 
topology This information may include the type of net- 
work. IP address, subnet mass, and the times associat- 
ed with the creation and modification of the network en- 
try in the table. A segment table contains information 
associated with each segment (or subnet) in the network 
topology. This information may Include the name of the 
subnet, number of interfaces connected to the subnet, 
and the times associated with the creation and modifi- 
cation of the subnet entry in the table. A node table con- 
tains information associated with each node in the net- 
work topobgy. The node information may include, for 
example, the IP network manager, a SNMP system de- 
scription, the number of interfaces in the node, and 
times associated with the creation and modification of 
the node entry in the table. The information stored in the 
IP topology database is primarily obtained from the dis- 
covery process, but may also be entered from network 
management personnel. 

From the I P topology database, an I P topology map 
can be created. The IP map is a map of the network 
topology which places the discovered nodes In appro- 
priate subnets, networks, and/or internets depending 
upon the level of the topology being mapped. In the sys- 
tem of the present invention, the IP map is preferably 
updated as the status of a node changes. The IP map 
displays the different nodes using Icons or symbols that 
represent the node from, for example, an SNMP MIB 
file. 

As discussed above, some current network man- 
agement systems limit the number of unacknowledged 
pins to three nodes so as to prevent flooding the network 
with pings. 

Referring now to Fig. 1 , a block diagram of the queu- 
ing sequence for sending pings to different nodes is 
shown. Queues 1 , 2 and 3 store the ping count for nodes 
A, B and C respectively The queues are not cleared until 
the ping is acknowledged or when the time for each ping 
expires, i.e. a timeout occurs for the Mth ping, and the 
node is declared to have failed. Thus, a ping cannot be 



sent to node D until one of the queues is cleared. 

Referring to Fig. 2, a block diagram of an exemplary 
network topology is shown to which the present inven- 
tion may apply. 
s The network manager according to the present in- 
vention provides a status poll transmission queue which 
speeds the processing of acknowledgements by storing 
unacknowledged pings in an ordered data table of arbi- 
trary size indexed by the I P address of each target node. 
10 The size of the data table may be fixed or it may vary. 
To speed the management of timeouts, unacknowl- 
edged pings are also stored in an ordered data table 
indexed by the time by which a timeout is scheduled to 
occur for a particular ping. Each record in each data ta- 
15 ble contains a pointer to a corresponding record In the 
other table to facilitate rapid removal of the managed 
node from the queue in the event a timeout occurs for 
the Mth ping, or upon receipt of an acknowledgement of 
the ping, whichever occurs first. 

Referring to Figs. 3-5, a status poll transmission 
mechanism and queue for nodes A-Z are illustrated, 
where A-Z represent the identity of each node assigned 
to the node manager. The status poll transmission 
queue 10 identifies the nodes which are scheduled to 
be polled. The status poll transmission queue 10 stores 
the node identity of the nodes which are awaiting trans- 
mission of a poll, and is preferably a FIFO (first in first 
out) queue or a FCFS (first come first serve) queue. 
However, other types of queues may be utilised, e.g.. a 
LCFS (last come first serve) queue. A queue might also 
be ordered by some attribute of the objects waiting in it, 
such as priority class or node type. A rate control mech- 
anism 12 controls the rate at which the pings are sent 
on the network to the nodes. As the pings are sent, 
records of the transmission of the pings are stored in an 
unacknowledged poll table (see Figs. 4 and 5). As not- 
ed, the unacknowledged poll table consists of two data 
records (an IP record and a timeout record) that are con- 
figured to allow an arbitrary number of nodes to be 
polled concurrently without receiving an acknowledge- 
ment. This configuration allows many status polls to be 
outstanding (unacknowledged) at one time. The rate 
control mechanism 12 (see Fig. 3) prevents the network 
from being flooded with pings. Combining the utilisation 
of the unacknowledged poll table configuration with the 
rate control mechanism 1 2 allows the network to be dis- 
covered rapidly even when status polls are unacknowl- 
edged for long periods of time. As seen in Fig. 4, the IP 
record is Indexed by the IP address of the target nodes, 
and the timeout record is indexed by the scheduled time- 
out for the particular ping being transmitted. The timeout 
record also includes a poll count record. The scheduled 
timeout is the time period between successive pings tar- 
geted at a particular node. The poll count record repre- 
sents an arbitrary number of pings that have been sent 
to the target node before the node is determined to have 
failed. The maximum ping count may be set by networic 
management personnel or. more usually, by the design- 
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er of a network management system. N^rious factors, 
such as the acknowledgement return time and the prob- 
ability of packet loss, are considered when determining 
the ping count. The acknowledgement return time Is the 
time it takes for the acknowledgement to be received by 
the network management station. 

The scheduled timeout may be set to a fixed, pre- 
determined period of time between each ping. Prefera- 
bly, the scheduled timeout between pings varies de- 
pending upon the ping count. For example, in a config- 
uration where the ping count is four, the scheduled time- 
out between a first ping and a second ping may be set 
to about ten seconds, the timeout between the second 
ping and a third ping may be set to about twenty sec- 
onds, the timeout between the third ping and the fourth 
ping may be set to about forty seconds, and the time 
between the fourth ping and the declaration of a failed 
node may be set to about eighty seconds. 

Once a prescribed sequence of timeouts has been 
recorded by the network management station, the node 
is declared to have failed and the change in status of 
the network is stored in the IP topology database and 
reflected in thelPnnap. 

Referring to Fig. 6, an exemplary network topology 
map is illustrated wherein the hub and its associated 
managed nodes were determined to have failed to ac- 
knowledged the pings. 

During the discovery process the IP addresses of 
new nodes arrive in bulk on retrieved list (ARP cache) 
causing status polling requests (pings) of previously un- 
known nodes to be generated in bursts. To prevent the 
consequent pings messages from flooding the network, 
the system of the present invention regulates the trans- 
mission of the pings. That is, the system of the present 
invention schedules the pings for transmission in rapid 
succession at a controlled rate which may be user spec- 
ified. The controlled rate of ping transmission may be 
dependent upon various factors including, for example, 
the current payload on the network, the current spare 
capacity on the network, and the buffer size in the por- 
tion of the kernel of the network management statton's 
operating system that supports network activity Prefer- 
ably, the rate is no faster than that at which the kernel 
(i.e. the portion of the operating system of the network 
management station that supports process manage- 
ment and some other system functions) can handle ac- 
knowledgements. Alternatively the rate may be auto- 
matically adjusted as the ability of the kernel to handle 
acknowledgements changes. For example, if the spare 
capacity of the network increases, or if the payload on 
the network decreases, the rate at which pings may be 
sent also may be increased. Alternatively, If the spare 
capacity of the network decreases, or if the payload on 
the network increases, the rate at which pings may be 
sent may also be decreased. 

As noted, to prevent a flood of pings on the network 
the pings are scheduled for transmission In rapid suc- 
cession at the controlled rate using, for example, the 



rate control mechanism. One method for monitoring the 
throughput of pings is similar to the 'leaky bucket' mon- 
itoring algorithm used to provide a sustained throughput 
for the transmission of asynchronous transfer mode 

s (ATM) cells in an ATM networi^. A description of the leaky 
algorithm can be found in "Bandwidth Management: A 
Congestion Control Strategy for Broadband Pocket Net- 
works-Characterizing the Throughput-burstiness Fil- 
ter', by A.E. Eckberg, D.T Luan and D.M. Lucantonl. 

10 Computer Networks and ISDN Systems 20 (1990) pp. 
415-423, which is Incorporated herein by reference. 
Generally, In the "leaky bucket" algorithm, a set of 
number of pings are transmitted within a specified time 
frame, and pings in excess of this number can be 

IS queued. As noted, the controlled rate can be set by net- 
work management personnel or can be automatically 
adjusted by the network management station. 

Referring to Fig. 7, a flow diagram of the operation 
of the network management station during discovery 

20 and status verification is shown. Initially, in discovery the 
network management station receives ARP caches and 
router tables from various nodes on the network via a 
configuration poll. The ARP caches and routing tables 
provide the network management station with, for ex- 

25 ample, the IP address of nodes along the network. The 
informatbn obtained from the ARP cache and the rout- 
ing tables Is then stored In an IP topology database. As 
noted, the determination to manage the node is made 
by the network management station or network man- 

30 agement personnel. 

To verify the status of nodes, the IP addresses of 
the known nodes are stored in, for example a status poll 
transmission queue (seen In Fig. 3) which identifies the 
nodes that are to be polled (step 51 4). When the network 

35 management station Is perfomning status verification 
tasks, pings are sent to the newly discovered nodes and 
nodes identified in the status poll transmission queue at 
the designated IP addresses (step 516). As discussed 
above, the pings are sent in a controlled sequence at a 

40 predetermined rate. 

As the pings are sent, the IP address associated 
with each polled node is stored in IP record of an unac- 
knowledged poll table. Simultaneously, a poll count 
record in a timeout record of the unacknowledged poll 

45 table Is Incremented by one and the timeout becomes 
the timeout associated with the new poll count (step 
518). Thereafter, the IP address for the node is deleted 
from the status poll transmission queue (step 520). 
Once the ping is sent and the IP address for the node 

so is deleted from the queue, the system goes into a sleep 
mode with respect to the particular node until the ping 
is acknowledged or a corresponding timeout occurs, 
whichever occurs first (step 522). For each node in the 
newly retrieved ARP cache that is not known to the net- 

5S work management database, a status poll (ping) is sent 
in accordance with step 514 above. If the ping has been 
acknowledged, the network management station pref- 
erably deletes the IP record and timeout records in the 
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unacknowledged poll table (step 524). 

If the scheduled timeout for a ping occurs first, the 
network management station retrieves the ping count 
from the ping count record (step 526) and determines if 
the ping count matches the predetermined number of 
counts, I.e. the station determines if the ping count is at 
the maximum number (step 528). If the ping count does 
not match the predetermined count number, the IP ad- 
dress for the node is stored in the status poll transmis- 
sion queue (step 51 4) and a new ping is sent to the same 
target node and the network management station re- 
peats the steps, as shown in Fig. 7. 

If at step 528 the ping count does match the prede- 
termined count number, then the node is determined to 
have failed (step 530). Thereafter, the IP topology data- 
base IS updated with the change in status of the node. 
The record for that node is then removed from the status 
poll transmission queue and acknowledged poll table 
(step 532). 

This process can be performed concurrently for 
many nodes thus reducing the delay until each man- 
aged node is polled and Increasing the currency of the 
IP topology map. 



Claims 



node are unacknowledged and the polling mes- 
sage count reaches the predetermined number 
and the timeout has expired the node is deter* 
mined to have failed. 

5 

3. A method according to claim 2, characterized in that 
the elapsed timeout period is about 2.5 minutes. 

4. A method according to any preceding claim, 
10 characterized by the steps of:- 

deleting the network address of a node from 
said queue (10) after a polling message Is 
transmitted to that node; and 
1^ it that polling message is unacknowledged and 

the node was determined not to have failed 
then adding the network address of that node 
to said queue so that another polling message 
Is sent to that node. 

20 

5. A method according to any preceding claim, 
characterized in that the polling message Is an in- 
ternet control message protocol polling message; 
the network address Is an internet protocol address; 

26 and said predetermined number of polling messag- 
es is four. 



1. A method for monitoring nodes in a network having 
at least one network management station and a plu- 
rality of nodes, characterised by the steps of:- 30 

providing a queue (10) of polling messages for 
transmission to the nodes, each polling mes- 
sage being indexed by the network address of 
one of the nodes; ss 
sending said queued polling messages from 
the network management station to the plurality 
of nodes at a predetermined rate; 
recording transmission of the polling messages 
in a table having a first portion indexed by the 40 
network address of each node and a second 
portion indexed by a timeout associated with a 
polling message count for each node having an 
outstanding status poll; and 
determining for each node if that node has 
failed after a predetermined number of polling 
messages have been sent to that node. 

2. A method according to claim 1, wherein said step 

of determining if a node has failed is characterized so 
by the steps of:- 



6. A system for managing a network comprising at 
least. one network management station, and a plu- 
rality of nodes connected to the network manage- 
ment station for data communications there be- 
tween, characterized in that each said network 
management station includes a queue (10) of poll- 
ing messages for transmission to the nodes, and a 
poll table having a first portion indexed by a network 
address of each node and a second portion indexed 
by a timeout associated with a polling message 
count; and wherein saki network management sta- 
tion determines for each node if that node has failed 
after a predetermined number of polling messages 
have been sent to that node within an elapsed time- 
out period. 

7. A system according to claim 6. characterized in that 
said network management station has a rate control 
mechanism (12) for controlling the rate at which 
polling messages are transmitted. 



determining if the count of polling messages 
sent to that node has reached the predeter- 
mined number; and ss 
determining if an elapsed timeout period for that 
particular polling message count has expired, 
such that when the polling messages sent to a 
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